In this project you should use Python 3, Jupyter Notebook and Scikit-learn. You are also allowed to use Orange3.
The dataset to be analysed is ModifiedHousePrices.csv, a modified version of the train dataset used in Kaggle's competition "House Prices: Advanced Regression Techniques".
If you ask a home buyer to describe their dream house, they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With more than 70 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
This project challenges you twice by asking you to tackle a
The variables are described here.
The targets are:
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict in the regression task (same as in the Kaggle challenge).Price3Classes - the price category, where price can be below 200000 ("<200000"), between 200000 and 400000 ("[200000,400000]"), or above 200000 (">200000"). This is the target variable that you're trying to predict in the classification task. In this notebook the original dataset will be analysed, pre-processed, and exported.
# Imports libraries
import pandas as pd
import os
from matplotlib import pyplot as plt
import numpy as np
from imblearn.combine import SMOTETomek
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import plotly.graph_objs as go
import plotly.io as pio
pio.templates.default = "gridon"
from plotly.subplots import make_subplots
# Import local modules
from pp_functions import *
#IPython Configs
%matplotlib inline
%load_ext autoreload
%autoreload 2
# Get data_path
path = get_path()
# Load Dataset
df = pd.read_csv(os.path.join(path,"ModifiedHousePrices.csv"))
df.drop('Id', axis=1, inplace=True)
df.head()
# Dataset Profiling
profile(df)
In a real EDA, all features should be individually analysed and improved when possible. In this case, having in consideration the sheer size of the dataset, and the time-frame available this is not possible. So this problem will be treated has a big data problem and the dataset will be processed using a general pipeline, that might not be the optimal solution in every case but it's an acceptable solution, and the only possible solution in the time available to complete this assignment.
Nevertheless, some attention was given to details, and checks were made to ensure the quality of all features retained.
It is important to also have in mind that classification and regression are two very different tasks, in a real scenario this would also need to be considered when performing feature engineering. For example, for classification having a mix of numerical and categorical variables will probably produce a worst result than binning all numerical variables. But for regression numerical features are very important.
# Separate Features from targets
labels = ['SalePrice', 'Price3Classes']
targets = df[labels]
df.drop(labels, axis=1, inplace=True)
In a real case scenario, most features with a poor quality could be improved using, different methods, like transformation into Boolean features.
It is the case of the Alley feature for example, this feature indicates the type of alley entrance, but most houses do not have alleys, so there is a lot of missing data. This feature could be transformed to 0 or 1 (does not have alley, has alley), this could be useful to distinguish high level houses. But this features would probably not be selected in a feature selection process since it has low variation, must houses do not have alleys, and so this feature does not help to distinguish the 90%+ houses that don't have alleys.
For this reason, all features with missing data or zeros on more than 80% of data points, will be discarded since they do not provide a lot of information.
In case zeros or missing data total between 30% to 80% of a feature's cells, that feature will be transformed into a binary feature since this mainly happens for presence or not of a given characteristic.
In other cases, zeros can be maintained, since they don't add bias to the classifiers and some non-binary features need to be maintained to give discrimination power to the models.
In the end it is verified if there exist any features where the same value represents 80% or more, of the total cells. This way features that will not improve discrimination between properties are eliminated. This step might be opted out, since this will affect classification of the minority class, (high end housing).
This pipeline ensures a good, and fast data pre-processing, although in some cases not optimal.
df, _del, cat_cols = feature_transform(df)
_del.head()
Most deleted features could be transformed into binary features, and some might even be related to the target, but due to a low variance, in this dataset, would not help models to perform better.
Has was sad previously, numerical features are very important, most classification and regression models achieve better performance with this data type, with tree models being the exception
The combination of both numerical and categorical data normally leads to poor performance. So, it was implemented a pipeline able to transform this dataset into a continuous dataset, by means of a combination of techniques:
This pipeline generates continuous features, ensuring an uniform dataset, with low dimensionality (since autoencoders also serve has dimensionality reduction tools), and removed noise.
Of course, this type of dataset shifts models expected to perform better, to numerical based models like:
X = encoder(df, cat_cols)
pd.DataFrame(X).head()
# To train the NN
#X_encoded = generate_representation(X, train=True, encoding_dim=10, epochs=1000)
# Load Model from saved files
X_encoded = generate_representation(X, train=False)
pd.DataFrame(X_encoded).head()
In the end 10 features are going to be used. These features are representations of the original dataset.
The loss is satisfactory, there is space for improvement, nevertheless. This could be done by increasing training time, and by some hyperparameter tuning.
# Plot Target Again
make_bar_chart(targets['Price3Classes'])
There is a considerable difference between the classes.
To solve this issue, a mixture of under sampling and over sampling:
SMOTE Function Tomek Links FunctionThe SMOTE function will generate new data points in the minority category, based on points that already existed in the dataset. Subsequently through the Tomek Links technique, the data is cleaned by removing near neighbors from different classes. This way, allowing models to discriminate classes better, since ambigous datapoints are removed.
#Aplly under and over sampling
smt = SMOTETomek()
X_class, y_class = smt.fit_sample(X_encoded, targets['Price3Classes'])
# Plot Target Again
make_bar_chart(y_class)
print('Dataset has now {}(+{}) datapoints.'.format(len(X_class), len(X_class) - len(X_encoded)))
# Encode Target
le = LabelEncoder()
y_class = le.fit_transform(y_class)
set(y_class)
There was a big increase in datapoints (datapoints more than doubled), that will increase bias significatively.
fig = make_subplots(rows=1, cols=2)
fig.add_trace(go.Histogram(x=targets['SalePrice']),row=1,col=1)
fig.add_trace(go.Box(y=targets['SalePrice'],boxpoints='all',line_color='orange'),row=1,col=2)
fig.update_layout(height=500, showlegend=False,title_text="Sale Price Distribution and Box Plot")
fig.show()
In case plotly is not configured in your framework here is a snapshot.
# prepare dataframes
balanced_df = pd.DataFrame(X_class)
balanced_df['target'] = y_class
normal_df = pd.DataFrame(X_encoded)
normal_df['target'] = targets['SalePrice']
# export dataframes
balanced_df.to_csv(os.path.join(path,"balanced_df.csv"))
normal_df.to_csv(os.path.join(path,"normal_df.csv"))